Objectives

This practical will show you how to run a phylogenetic pipeline from consensus sequences including creating an alignment, specifying a suitable outgroup for rooting, inferring a phylogeny using the distance-based neighbour-joining method and using a maximum-likelihood approach, resolving alignment issues and using bootstrapping to quantify uncertainty.

We will be using the ape (Analysis of Phylogenetics and Evolution) and phangorn packages for this practical, which have several tools for manipulating DNA sequence data in R and carrying out phylogenetic inference. The first step is to make sure you have these installed, as well as the other packages we need, using install.packages().

Load packages

require(data.table)
require(knitr)
# phylo packages
require(ape)
require(ade4)
require(phangorn)
require(phytools)
# packages for plotting
require(ggplot2)
require(ggtree)
require(ggsci)
require(viridis)

Data

Download the consensus sequences for this practical from github.com/MLGlobalHealth/aims_rwanda_2024, which are a subset of partial polymerase (pol) sequences from HIV positive individuals from the Democratic Republic of Congo from this study by Faria et al, 2019, tracing the evolutionary dynamics of major HIV-1 subtypes across Central and Eastern Africa. The data were obtained from the Los Alamos HIV sequence database, by searching for the GenBank Accession numbers listed in the manuscript and selecting a subset of subtypes for the purpose of this practical.

To view the alignemt, you can download an alignment viewer such as AliView (requires download) or jalview (web interface). Here is a snapshot of part of the alignment:

# TODO: update these directories accordingly
data.dir <- "~/Documents/GitHub/aims_rwanda_2024/day1/practical3_phylo"
out.dir <- "~/Documents/GitHub/aims_rwanda_2024/day1/practical3_phylo"

knitr::include_graphics(file.path(data.dir, "faria_2019_alignment.png"))

The taxa names are named uniformly in the Los Alamos HIV sequence database, with the format <Subtype.Isocode.Year.Name.Accession>.

Read in the alignment

Your first task is to load the multiple sequence alignment and have a look at its structure.

Read in the .fasta file. Note the format of FASTA - chevron denotes the taxon name. Gaps ‘-’ are indel events (insertions/deletions). ‘N’ represents unknown characters.

How many sequences do we have and how long are they?

# read data
filename <- file.path(data.dir,'faria_2019.fasta')

dat <- read.dna(file = filename, format = "fasta")
dat
## 55 DNA sequences in binary format stored in a matrix.
## 
## All sequences of same length: 1080 
## 
## Labels:
## G.CD.2008.DRC319.MN178973
## G.CD.2008.DRC378.MN179012
## G.CD.2008.DRC358.MN178994
## G.CD.2008.DRC685.MN179134
## G.CD.2008.DRC341.MN178978
## G.CD.2008.DRC752.MN179167
## ...
## 
## Base composition:
##     a     c     g     t 
## 0.391 0.161 0.207 0.241 
## (Total: 59.4 kb)

Check the format of the data - the ape package in R handles objects with class DNAbin.

class(dat)
## [1] "DNAbin"

look at positions 1:10 in the alignment of the first 5 taxa

as.character(dat)[1:5, 1:10]
##                           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## G.CD.2008.DRC319.MN178973 "c"  "c"  "c"  "t"  "t"  "a"  "g"  "c"  "c"  "t"  
## G.CD.2008.DRC378.MN179012 "c"  "c"  "c"  "t"  "t"  "a"  "a"  "c"  "t"  "t"  
## G.CD.2008.DRC358.MN178994 "c"  "c"  "c"  "t"  "t"  "a"  "g"  "c"  "t"  "t"  
## G.CD.2008.DRC685.MN179134 "c"  "c"  "t"  "t"  "t"  "a"  "a"  "c"  "t"  "t"  
## G.CD.2008.DRC341.MN178978 "c"  "c"  "t"  "t"  "t"  "a"  "g"  "c"  "t"  "t"

We can also visualise these data in R, re-coding ambiguous sites as ‘others’. The x-axis is the position in the alignment, but this is not meaningful unless it is mapped to some known genomic coordinates. The y-axis are the taxa. We show the first 30 taxa only.

par(mar=c(4,5,4,2))
ape::image.DNAbin(dat[1:30,], cex.lab = 0.4, cex.axis = 0.7)

Generate the genetic distance matrix

Generate the distance matrix using the function dist.dna() from the ape package, assuming the Tamura and Nei model for the evolutionary rate, which has two parameters allowing for different base frequencies and different transition and substitution rates. This function generates a matrix of pairwise distances between taxa. Plot the matrix.

Note that entries \(i=j\) have a genetic distance of zero.

# calculate the pairwise genetic distances using Tamura and Nei model
D <- dist.dna(dat, model = "TN93")

# plot the resulting distance matrix
temp <- as.data.frame(as.matrix(D))
temp <- t(as.matrix(D))
temp <- temp[, ncol(temp):1]
par(mar = c(1, 5, 5, 1))

# melt the matrix to a data.table 
tmp <- data.table(reshape2::melt(temp))
# ensure the ordering of taxa are the same for plotting the matrix
tmp[, Var2:= factor(Var2,levels=levels(Var1))]

ggplot(tmp) + geom_tile(aes(x=Var1,y=Var2,fill=value)) +
  scale_fill_viridis(option="magma",direction=-1) + 
  labs(x='',y='',fill='Genetic distances\n(Tamura and Nei model)') +
  theme_bw(base_size=6) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

There are two types of HIV, HIV-1 and HIV-2, which can further be classified by subtype. All the sequences in this study are from individuals with HIV-1, which is the most prevalent type of HIV globally. The taxa labels indicate subtype. Subtypes are an indication of virus similarity. Which sequences have the smallest distances?

We find that sequences of the same subtype typically have smaller genetic distances than those of different subtypes.

Build the tree

We would now like to visualise the relationship between these sequences in a more informative way by building a phylogeny.

We can use the ape package in R, which has a few algorithms available. Build a tree using the neighbour-joining algorithm.

# build tree with neighbour-joining method
tre <- nj(D)
# trees which are generated using ape have class 'phylo'
class(tre)
## [1] "phylo"
summary(tre)
## 
## Phylogenetic tree: tre 
## 
##   Number of tips: 55 
##   Number of nodes: 53 
##   Branch lengths:
##     mean: 0.006456408 
##     variance: 4.723129e-05 
##     distribution summary:
##         Min.      1st Qu.       Median      3rd Qu.         Max. 
## 9.240982e-05 9.925795e-04 3.073291e-03 1.040831e-02 3.841085e-02 
##   No root edge.
##   First ten tip labels: G.CD.2008.DRC319.MN178973 
##                         G.CD.2008.DRC378.MN179012
##                         G.CD.2008.DRC358.MN178994
##                         G.CD.2008.DRC685.MN179134
##                         G.CD.2008.DRC341.MN178978
##                         G.CD.2008.DRC752.MN179167
##                         G.CD.2008.DRC848.MN179213
##                         02_AG.CD.2008.DRC828.MN179202
##                         G.CD.2008.DRC117.MN178943
##                         G.CD.2008.DRC589.MN179099
##   No node labels.

Plot the tree. What do you notice? Which tips share a common ancestor? Do any of the tree features suggest something has gone wrong with the alignment or phylogenetic inference?

# map the taxa labels to colours based on their HIV subtype (for plotting)
select <- data.table(taxa=labels(dat))
select[, ST:=gsub('([A-Za-z0-9_]+).*','\\1',taxa)]
mypal <- pal_npg('nrc')(4)
select[, ST:= factor(ST)]

colmap <- data.table(ST=c('G','02_AG','A','Ref'),
                     color=mypal)
select <- merge(select,colmap,by='ST')
cols <- with(select, as.character(color[match(labels(dat), taxa)]))

# plot tree
plot(tre, cex = 0.6, tip.color = cols)
add.scale.bar(x=0,y=0)
title("Tree using neighbour-joining algorithm")

Correct the alignment

See if you can find the reason for the long branch by taking a look at the alignment.

Open the .fasta file with an alignment viewer such as AliView jalview and see if you can find and correct the alignment error. Hint: the sequences in the alignment must all be the same length.

Export the amended alignment and save it with a different name in .fasta format.

Read in the new .fasta file and repeat the steps above using the following code. Now the distance matrix no longer shows large differences between all sequences and the misaligned sequence. The long branch has also gone from the tree.

What else do you notice? Hint: the tips of the tree have been coloured according to their HIV-1 subtype.

filename <- file.path(data.dir,'faria_2019_fixedalignment.fasta')

dat <- read.dna(file = filename, format = "fasta")

D <- dist.dna(dat, model = "TN93")

# plot distance matrix
temp <- as.data.frame(as.matrix(D))
temp <- t(as.matrix(D))
temp <- temp[, ncol(temp):1]
par(mar = c(1, 5, 5, 1))

# melt the matrix to a data.table 
tmp <- data.table(reshape2::melt(temp))
# ensure the ordering of taxa are the same for plotting the matrix
tmp[, Var2:= factor(Var2,levels=levels(Var1))]

ggplot(tmp) + geom_tile(aes(x=Var1,y=Var2,fill=value)) +
  scale_fill_viridis(option="magma",direction=-1) + 
  labs(x='',y='',fill='Genetic distances\n(Tamura and Nei model)') +
  theme_bw(base_size=6) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

select <- data.table(taxa=labels(dat))
select[, ST:=gsub('([A-Za-z0-9_]+).*','\\1',taxa)]
mypal <- pal_npg('nrc')(4)
select[, ST:= factor(ST)]

colmap <- data.table(ST=c('G','02_AG','A','Ref'),
                     color=mypal)
select <- merge(select,colmap,by='ST')
cols <- with(select, as.character(color[match(labels(dat), taxa)]))

# build tree
tre <- nj(D)

plot(tre, cex = 0.6, tip.color = cols)
add.scale.bar(x=0,y=0)
title("Tree using neighbour-joining algorithm")

We notice that there is a recombinant of subtypes A and G (light blue tip). The algorithm doesn’t know where to put it in the tree, and we have a clade of both subtype A and subtype G viruses, whereas we’d expect the subtype A viruses to share a most recent common ancestor with all other subtype A sequences, and same for subtype G. What happens if we remove it?

Hint: the taxa labels for the alignment are stored in labels(dat).

Note that it is not necessary to exclude recombinants from phylogenetic analyses as a rule. However we have to be careful how we handle them. If we are interested in a particular recombinant circulating in the population, we might generate a separate phylogeny for just that recombinant.

Remove the recombinant

# find recombinant taxa label which isn't subtype G or A
to_remove <- labels(dat)[!gsub('([A-Za-z0-9_]+).*','\\1',labels(dat)) %in% c('G','A','Ref')]

# remove recombinant
dat <- dat[!labels(dat)==to_remove,]

# tip colours
cols <- with(select, as.character(color[match(labels(dat), taxa)]))

# recompute distance matrix
D <- dist.dna(dat, model = "TN93")

# generate tree
tre <- nj(D)

# plot tree
plot(tre, cex = 0.6, tip.color = cols)
add.scale.bar(x=0,y=0)
title("Tree using neighbour-joining algorithm")

Root the tree

The reference sequence (subtype C) is part of the G clade, and we still have some subtype A sequences in a clade with subtype G sequences. In general, the topology can be misleading if the tree is not rooted, especially with respect to isolates. We can tell the algorithm where the root should be by specifying the outgroup (the sequence beginning with Ref.).

We can root the tree using the command root().

# find the tip label containing 'Ref'
lab.root <- labels(dat)[grep('Ref',labels(dat))]

# re-root the tree
tre2 <- root(tre, outgroup = lab.root, resolve.root=TRUE)
plot(tre2, cex = 0.6, tip.color = cols)
add.scale.bar(x=0,y=0)
title("Rooted tree using neighbour-joining algorithm")

Now by rooting the tree, the reference sequence has moved to be the ancestor of all the taxa. We now also see that all the subtype G taxa are in one clade, and all the subtype A taxa are in another.

Extract a clade

There are a few different tools in ape for plotting. See if you can extract the G clade and plot it. You will need the following functions: fastMRCA() and extract.clade().

node <- fastMRCA(tre2,"G.CD.2008.DRC319.MN178973",
    "G.CD.2008.DRC358.MN178994")
G.clade <- extract.clade(tre2,node)
plot(G.clade, cex = 0.6, tip.color = mypal[1])
add.scale.bar(x=0,y=0)
title("Subtype G clade")

Bootstrapping

We are interested in seeing how robust our tree is.

Let’s bootstrap resample each site of the alignment with replacement, carry out phylogenetic inference as before, and plot the tree with internal nodes labelled with the number of bootstrapped trees in which we observe this internal node. If the tree is robust, we would hope to observe the same internal nodes across the original tree and the bootstrap trees.

# bootstrap 100 times
myBoots <- boot.phylo(tre2, dat, function(e) nj(dist.dna(e, model = "TN93")),B=100)
## 
Running bootstraps:       100 / 100
## Calculating bootstrap values... done.
plot(tre2, cex = 0.6, tip.color = cols)
add.scale.bar(x=0,y=0)
title("NJ tree + bootstrap values")
nodelabels(myBoots, cex = 0.5)

What do you notice about the values?

Some of the bootstrap values at the internal nodes are quite low, which means some of these nodes are not well supported.

Re-build tree using maximum likelihood estimation

We will finally re-estimate the tree using a maximum-likelihood approach (Felsenstein, 1981), which uses the sequence data in a statistical framework where we seek to estimate model parameters. There other approaches to building trees we will not cover here, such as maximum parsimony, but you can explore these yourself.

We will use the phangorn package, which requires the data to be in a different format, phyDat.

We first compute the likelihood of the data using the neighbour-joining tree using the command pml().

# convert the alignment to PhyDat using phangorn package
dat2 <- as.phyDat(dat)
dat2
## 54 sequences with 1080 character and 482 different site patterns.
## The states are a c g t
tre2.lik <- pml(tre2, dat2)
tre2.lik
## model: JC 
## loglikelihood: -14305.83 
## unconstrained loglikelihood: -4573.897 
## 
## Rate matrix:
##   a c g t
## a 0 1 1 1
## c 1 0 1 1
## g 1 1 0 1
## t 1 1 1 0
## 
## Base frequencies:  
##    a    c    g    t 
## 0.25 0.25 0.25 0.25

We next optimise, where the algorithm searches the tree topology space to find the best tree for the chosen substitution rate model, optimising tree topology (optNni), base frequencies (optBf) and substitution rates (optQ).

opt.tre <- optim.pml(tre2.lik, optNni = TRUE, optBf = TRUE, optQ = TRUE,
 optGamma = TRUE)
## Warning: I unrooted the tree
## only one rate class, ignored optGamma
## optimize edge weights:  -14305.83 --> -13201.64 
## optimize base frequencies:  -13201.64 --> -13088.94 
## optimize rate matrix:  -13088.94 --> -12309.41 
## optimize edge weights:  -12309.41 --> -12305.71 
##  optimize topology:  -12305.71 --> -12211.99 
##  NNI moves:  9 
##  optimize topology:  -12211.99 --> -12175.43 
##  NNI moves:  6 
##  optimize topology:  -12175.43 --> -12161.79 
##  NNI moves:  4 
##  optimize topology:  -12161.79 --> -12132.25 
##  NNI moves:  5 
##  optimize topology:  -12132.25 --> -12120.07 
##  NNI moves:  5 
## NNI moves:  29 
## optimize base frequencies:  -12120.07 --> -12108.98 
## optimize rate matrix:  -12108.98 --> -12107.68 
## optimize edge weights:  -12107.68 --> -12107.65 
##  optimize topology:  -12107.65 --> -12101.21 
##  NNI moves:  4 
##  optimize topology:  -12101.21 --> -12087.42 
##  NNI moves:  4 
##  optimize topology:  -12087.42 --> -12072.41 
##  NNI moves:  4 
##  optimize topology:  -12072.41 --> -12057.99 
##  NNI moves:  4 
##  optimize topology:  -12057.99 --> -12057.02 
##  NNI moves:  2 
## NNI moves:  18 
## optimize base frequencies:  -12057.02 --> -12056.34 
## optimize rate matrix:  -12056.34 --> -12056.05 
## optimize edge weights:  -12056.05 --> -12056.05 
##  optimize topology:  -12056.05 --> -12050.45 
##  NNI moves:  1 
##  optimize topology:  -12050.45 --> -12050.45 
##  NNI moves:  0 
## NNI moves:  1 
## optimize base frequencies:  -12050.45 --> -12050.33 
## optimize rate matrix:  -12050.33 --> -12050.29 
## optimize edge weights:  -12050.29 --> -12050.29 
##  optimize topology:  -12050.29 --> -12050.29 
##  NNI moves:  0 
## NNI moves:  0 
## optimize base frequencies:  -12050.29 --> -12050.26 
## optimize rate matrix:  -12050.26 --> -12050.25 
## optimize edge weights:  -12050.25 --> -12050.25 
## optimize base frequencies:  -12050.25 --> -12050.25 
## optimize rate matrix:  -12050.25 --> -12050.25 
## optimize edge weights:  -12050.25 --> -12050.25 
## optimize base frequencies:  -12050.25 --> -12050.24 
## optimize rate matrix:  -12050.24 --> -12050.24 
## optimize edge weights:  -12050.24 --> -12050.24 
## optimize base frequencies:  -12050.24 --> -12050.24 
## optimize rate matrix:  -12050.24 --> -12050.24 
## optimize edge weights:  -12050.24 --> -12050.24 
## optimize base frequencies:  -12050.24 --> -12050.24 
## optimize rate matrix:  -12050.24 --> -12050.24 
## optimize edge weights:  -12050.24 --> -12050.24 
## optimize base frequencies:  -12050.24 --> -12050.24 
## optimize rate matrix:  -12050.24 --> -12050.24 
## optimize edge weights:  -12050.24 --> -12050.24
opt.tre
## model: F81 
## loglikelihood: -12050.24 
## unconstrained loglikelihood: -4573.897 
## 
## Rate matrix:
##          a         c        g         t
## a 0.000000  2.406060 8.859153  1.189172
## c 2.406060  0.000000 1.038145 10.355152
## g 8.859153  1.038145 0.000000  1.000000
## t 1.189172 10.355152 1.000000  0.000000
## 
## Base frequencies:  
##         a         c         g         t 
## 0.3659051 0.1775044 0.2214021 0.2351884

We could also optimise for a given substitution model, with the option: model = ‘JC’ or ‘GTR’.

We can then compare the optimal tree with the initial tree with a likelihood ratio test and confirm that it has a substantially larger likelihood and a lower AIC, indicating the optimised tree is better model for the sequence data.

anova(tre2.lik, opt.tre)
## Likelihood Ratio Test Table
##   Log lik.  Df Df change Diff log lik. Pr(>|Chi|)    
## 1   -14306 106                                       
## 2   -12050 113         7        4511.2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can plot it again,

opt.tre.root <- root(opt.tre$tree, outgroup = lab.root, resolve.root=TRUE)
plot(opt.tre.root, cex = 0.6, tip.color = cols)
add.scale.bar(x=0,y=0)
title("Rooted maximum likelihood tree")

We can also re-compute the bootstrap values for internal nodes.

bs <- bootstrap.pml(opt.tre, bs=100, optNni=TRUE, multicore=TRUE, control = pml.control(trace=0))
plotBS(midpoint(opt.tre$tree), bs, type="p",cex = 0.6, tip.color = cols)
add.scale.bar()
title("ML tree")